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Module 5: Advanced Analytics - Technology and 
Tools 

Upon completion of this module, you should be able to: 

• Perform Analytics on Unstructured data using MapReduce 
Programming paradigm 

• Use Hadoop, HDFS, HIVE,PIG and other products in the Hadoop 
ecosystem for unstructured data analytics 

• Effectively use advanced SQL functions and Greenplum 
extensions for in-database analytics 

• Use MADlib to solve analytics problems in-database 
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Module 5: Advanced Analytics - Technology and Tools 

Lesson 3: In-database Analytics SQL essentials 


During this lesson the following topics are covered: 

• SQL Essentials 

• SET Operations 

• Online analytical processing (OLAP) features 

• GROUPING SETS, ROLLUP,CUBE 

• GROUPING, GROUPJD functions 

• Text processing, Pattern matching 
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Set Operations 

Greenplum supports the following set operations as part of a 
SELECT statement: 

• INTERSECT - Returns rows that appear in all answer sets 

• EXCEPT - Returns rows from the first answer set and excludes 
those from the second 

• UNION ALL - Returns a combination of rows from multiple 
SELECT statements with repeating rows 

• UNION - Returns a combination of rows from multiple SELECT 
statements with no repeating rows 
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Set Operations - intersect 


INTERSECT: 

• Returns only the rows that appear in both SQL queries 

• Removes duplicate rows 





\ 


Table A 

Intersect 







Table B 


N 



' 


SELECT t.transid, 
c . custname 

FROM facts . transaction t 

JOIN dimensions . customer c 

ON c.customerid = t.customerid 

INTERSECT 

SELECT tl.transid, 
cl . custname 

FROM facts . transaction tl 
JOIN dimensions . customer cl 

ON cl . customerid = tl . customerid 
WHERE tl. transdate BETWEEN 

'2008-01-01' AND '2008-01-21' 
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Set Operations - except 


EXCEPT: 

• Returns all rows from the first SELECT statement 


• Omits all rows that 
appear in the second 
SELECT statement 



SELECT t.transid, 
c . custname 

FROM facts . transaction t 

JOIN dimensions . customer c 

ON c.customerid = t.customerid 

EXCEPT 

SELECT tl.transid, 
cl . custname 

FROM facts . transaction tl 
JOIN dimensions . customer cl 

ON cl . customerid = tl . customerid 
WHERE tl. transdate BETWEEN 

'2008-01-01' AND '2008-01-21' 
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Set Operations - union all 


UNION ALL: 

• Combines rows from 
the first query with rows 
from the second query 

• Does not remove 
duplicates rows 



SELECT t.transid, 
c . custname 

FROM facts . transaction t 
JOIN dimensions . customer c 

ON c.customerid = t.customerid 

WHERE t . transdate BETWEEN 

'2008-01-01' AND '2008-05-17' 

UNION ALL 

SELECT tl.transid, 
cl . custname 

FROM facts . transaction tl 
JOIN dimensions . customer cl 

ON cl . customerid = tl . customerid 

WHERE tl. transdate BETWEEN 

'2008-01-01' AND '2008-01-21' 
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Set Operations - union 


UNION: 


• Combines rows from 


SELECT 


the first query with rows 
from the second query 

• Removes duplicates or 
repeating rows 


FROM 

JOIN 

ON 

WHERE 


Table A 


Table B 

Results of 
second query 

Results of 
first query 





'i 


Duplicat &^ ZZZZZZZZZZZ Z ; . 


ZZZZZZZZZZZZ 


?///////////; 


Table A plus Table B minus duplicates 


UNIO 

SELECT 


FROM 

JOIN 

ON 

WHERE 
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t . transid, 

c . custname 

facts . transaction t 

dimensions . customer c 

c.customerid = t.customerid 

t . transdate BETWEEN 

'2008-01-01' AND '2008-05-17' 


tl . transid, 

cl . custname 

facts . transaction tl 

dimensions . customer cl 

cl . customerid = tl . customerid 

tl. transdate BETWEEN 

'2008-01-01' AND '2008-01-21' 
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SET Operations 



•Types of Join 

• Inner 

• Left outer 

• Right outer 

• Full outer 

• Cross 


GD 


GD 


GD 

Inner Join 


Left Outer Join 


Right Outer Join 


GD 


Full Outer Join 


o-o 


Cross Join 
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p 


Left Outer Join 

• Correlated sub-queries do not run efficiently in Greenplum 
though support has been introduced in Version 4.2 

► SELECT * FROM transaction t 
WHERE NOT EXISTS ( 

SELECT 1 FROM customer c 
WHERE c.customerid = t.customerid) 

• Use LEFT OUTER JOIN 

► SELECT t.* FROM transaction t 
LEFT OUTER JOIN 

customer c ON t.customerid=c.customerid 
WHERE c.customerid IS NULL 
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Sub-query vs. Inner Join 

• IN clause is fully supported ... 

► SELECT* 

FROM transaction t 
WHERE t.customerid IN 
(SELECT customerid FROM customer) 


• However, generally better idea as long as c. customerid is unique: 

► SELECT t.* 

FROM transaction t 
INNER JOIN customer c 
ON c.customerid = t.customerid 
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Greenplum SQL OLAP Grouping Extensions 

Greenplum supports the following grouping extensions: 

• Standard GROUP BY 

• ROLLUP 

• GROUPING SETS 

• CUBE 

• grouping (column [ , . . . ] ) function 

• group id() function 
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Standard group by Example 


GROUP BY: 

• Group results based on one or more specified columns 

• Is used with aggregate statements 
The following example summarizes 




product sales by vendor: 

SELECT pn, vn, sum (prc*qty) 
FROM sale 

GROUP BY pn, vn 

ORDER BY 1,2,3; 





pn | 

vn | 

sum 

— 

-+ 

-+ 

100 

20 

0 

100 

40 

2640000 

200 

10 

0 

200 

40 

0 

300 

30 

0 

400 

50 

0 

500 

30 

120 

600 

30 

60 

700 

40 

1 

800 
(10 2 

40 
rows ) 

1 
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Standard GROUP BY Example with union all 


This example extends the previous I 



(19 rows) 
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rollup Example 


The following example meets the 
requirement where the sub-total 
and grand totals are to be included: 

SELECT pn, vn, sum (prc*qty) 

FROM sale 

GROUP BY ROLLUP (pn , vn) 

ORDER BY 1,2,3; 


pn | vn | sum 


100 

1 

20 

1 

0 

100 

1 

40 

1 

2640000 

100 

1 


1 

2640000 

200 

1 

10 

1 

0 

200 

1 

40 

1 

0 

200 

1 


1 

0 

300 

1 

30 

1 

0 

300 

1 


1 

0 

400 

1 

50 

1 

0 

400 

1 


1 

0 

500 

1 

30 

1 

120 

500 

1 


1 

120 

600 

1 

30 

1 

60 

600 

1 


1 

60 

700 

1 

40 

1 

1 

700 

1 


1 


800 

1 

40 

1 

1 


1 


1 

_ 

| | 2640182 


(19 rows) 



grouping sets Example 


The following example shows how 
to achieve the same results with the 

GROUPING SETS clause: 



pn | vn | sum 


100 

1 

20 

1 

0 

100 

1 

40 

1 

2640000 

100 

1 


1 

2640000 

200 

1 

10 

1 

0 

200 

1 

40 

1 

0 

200 

1 


1 

0 

300 

1 

30 

1 

0 

300 

1 


1 

0 

400 

1 

50 

1 

0 

400 

1 


1 

0 

500 

1 

30 

1 

120 

500 

1 


1 

120 

600 

1 

30 

1 

60 

600 

1 


1 

60 

700 

1 

40 

1 

1 

700 

1 


1 


800 

1 

40 

1 

1 


1 


1 

_ 

| | 2640182 


(19 rows) 




cube Example 


CUBE creates subtotals for all possible 
combinations of grouping columns. 

The following example 

SELECT pn, vn, sum (prc*qty) 

FROM sale 

GROUP BY CUBE (pn, vn) 

ORDER BY 1,2,3; 


is the same as 

SELECT pn, vn, sum (prc*qty) 
FROM sale 

GROUP BY GROUPING SETS 
( (pn, vn) , (pn) , 
(vn) , () ) 

ORDER BY 1,2,3; 


pn | vn | sum 

100 | 20 | 0 

100 I 40 | 2640000 

100 | | 2640000 


200 | 10 | 0 
200 I 40 I 0 



(24 rows) 
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grouping Function Example 

Grouping distinguishes NULL from summary markers. 


store 


product 


-+- 


SELECT 

store, customer, product, 
sum (price) , 

grouping (customer) 

FROM dsales_null 
GROUP BY 

ROLLUP (store, customer, 
product) ; 


price 


SELECT * FROM dsales null; 


store 


I 


customer | product 


sum 


-+- 


-+- 


-+- 


• + 


c2 

p2 

70 

c2 


70 

c3 

pi 

40 

c3 


40 



110 

cl 

pi 

90 

cl 

p2 

50 

cl 


140 


pi 

44 



44 



184 



294 


(12 rows) 


grouping 
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GROUP id Function 



GROUP_ID: 

• Returns 0 for each output row in a unique grouping set 

• Assigns a serial number >0 to each duplicate grouping set found 

• Can be used to filter output rows of duplicate grouping sets, such 
as in the following example: 


SELECT a, b, c, sum(p*q), group_id() 
FROM sales 

GROUP BY ROLLUP (a, b) , CUBE(b,c) 
having group_id()<l 

ORDER BY a , b , c ; 
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In-database Text Analysis 

• SQL features for 

► Text handling functions 

► Pattern matching with regular expressions 

• Example Use-cases 

► Filter emails with spam tag in subject 

► Extract domains from a URL 

► Extract all URLs from a HTML file 

► Check for Syntactically correct email addresses 

► Convert 10 digits into format "(123) 456-7890" 
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Pattern Matching - Regular Expressions (Regex) 

Regular Expression match Operators 


Operator 

Description 

Example 


Case sensitive substring 

‘Greenplum’ - tA Green’ 


Case-insensitive substring 

‘Greenplum’ ~*’ee+’ 


SQL Functions 

substring(s//7>7^ from, pattern [for escape]) 
regexp_matches(s//7>7d7, pattern , [flags]) 
regexp_rep\ace{strlng, pattern , rep/, [flags]) 
regexp_split_to_{array|table} 
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Regular Expression Quantifiers 


• Quantifier 


Expression 

Matches 

■ 

Arbitrary character 

A And $ 

Virtual characters for beginning and 
end 

* 

Preceding item zero or more times 

+ 

Preceding item one or more times 

? 

■ 

Preceding item is optional 

{n} 

Preceding item n times 

a|b 

Item a or b 

■ ■ ■ 

■ ■ ■ 
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Check Your Knowledge 

Your Thoughts? 

1 . How would you use GROUPING SETS to produce the same 
results as the following GROUP BY CUBE? 

SELECT state, productID, SUM(volume) FROM sales GROUP BY 
CUBE (state, productID) ORDER BY state, productID 

2 . How would you show the sub-totals for each week, for each 
state, and for each product? (No other totals or grand totals 
are required.) Suppose the table structure is 

TABLE sales (productID VARCHAR, state CHAR(2), week DATE, 
volume INT) 

3 . Discuss the utility of grouping and groupjd functions 
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Check Your Knowledge (Continued) 



Your Thoughts? 


4. Give regular expressions for the following: 

► A regex that, given a URL, captures the domain name 

► A regex that captures PostgreSQL Dollar-quoted String literals 
Examples: 

►► $test$This is a string$test$ 
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Lesson 3: Summary 


During this lesson the following SQL Essentials topics were 
covered: 

Online analytical processing (OLAP) features 

• GROUPING SETS, ROLLUP, CUBE 

• GROUPING, GROUPJD functions 
Text processing, Pattern matching 
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Module 5: Advanced Analytics - Technology and 
Tools 

Lesson 4: Advanced SQL and MADlib 


During this lesson the following topics are covered: 
Advanced SQL and MADlib: 

• Window functions 

• User defined functions and aggregates 

• Ordered Aggregates 

• MADlib 
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Window Functions 



• About Window Functions 

► Returns a value per row, unlike aggregate functions 

► Has its results interpreted in terms of the current row and its 
corresponding window partition or frame 

► Is characterized by the use of the OVER clause 

► Defines the window partitions, or groups of rows to apply 
the function 

► Defines ordering of data within a window 

► Defines the positional or logical framing of a row with 
respect to its window 
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Window Functions (Continued) 

• About Window Functions 

► Returns a value per row, unlike aggregate functions 

► Has its results interpreted in terms of the current row and its 
corresponding window partition or frame 

► Is characterized by the use of the OVER clause 

► Defines the window partitions, or groups of rows to apply 
the function 

► Defines ordering of data within a window 

► Defines the positional or logical framing of a row with 
respect to its window 
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Defining Window Specifications (OVER Clause) 

When defining the window function: 

► Include an OVER ( ) clause 

► Specify the window of data to which the function applies 
• Define: 

► Window partitions, using the PARTITION BY clause 

► Ordering within a window partition, using the ORDER BY clause 

► Framing within a window partition, using ROWS and RANGE 
clauses 

► The ORDER BY clause also defines a frame of unbounded 
preceding to current in the partition 
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About the partition by Clause 



The PARTITION BY clause: 

• Can be used by all window functions 

• Organizes result sets into groupings based on unique values 

• Allows the function to be applied to each partition 
independently 



Note: If the PARTITION BY clause is omitted, the entire 
result set is treated as a single window partition. 
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Window Partition Example 


SELECT * , 
row_number ( ) 
OVER ( ) 

FROM sale 
ORDER BY cn; 


SELECT * , 
row_number ( ) 

OVER (PARTITION 
BY cn) 

FROM sale 
ORDER BY cn; 


row number 

| 

cn 

1 vn 

1 pn 

| dt 

1 qty 

1 pro 

— 

- + 

— 

i 

i 

i 

i 

+ 

i 

i 

i 

i 

i 

i 

+ 

i 

i 

i 

i 

i 

i 

i 

i 

i 

i 

i 

i 

+ 

i 

i 

i 

i 

i 

+ 

i 

i 

i 

i 

+ 

1 

1 

i 

| 10 

200 

| 1401-03-01 

1 1 

l o 

2 

1 

i 

I 30 

300 

| 1401-05-02 

1 1 

l o 

3 

1 

i 

| 50 

400 

| 1401-06-01 

1 1 

I o 

4 

1 

i 

| 30 

500 

| 1401-06-01 

| 12 

1 5 

5 

1 

i 

| 20 

100 

| 1401-05-01 

1 1 

I o 

6 

1 

2 

| 50 

400 

| 1401-06-01 

1 1 

I o 

7 

1 

2 

| 40 

100 

| 1401-01-01 

| 1100 

I 2400 

8 

(8 rows) 

1 

3 

| 40 

200 

| 1401-04-01 

1 1 

1 o 


row number 

| cn 

| vn | pn | dt | qty 

| pro 

1 

i i 

| 10 

200 

1401-03-01 

1 

l o 

2 

i i 

| 30 

300 

1401-05-02 

1 

1 o 

3 

i i 

| 50 

400 

1401-06-01 

1 

1 o 

4 

i i 

| 30 

500 

1401-06-01 

12 

1 5 

_5 

i i 

1 20 

100 

1 401 -05-01 

1 

1 o 


400 | 1401-06-01 
100 I 1401-01-01 


40 


200 


1401-04-01 


(8 rows) 



Copyright © 2014 EMC Corporation. All Rights Reserved. 


Module 5: Advanced Analytics - Technology and Tools 82 



RANK and ORDER BY 



The ORDER BY clause: 

• Can always be used by window functions 

• Is required by some window functions such as RANK 

• Specifies ordering within a window partition 

The RANK built-in function: 

• Calculates the rank of a row 

• Gives rows with equal values for the specified criteria the same 
rank 
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Using the over (order by 


SELECT vn, 

sum (prc*qty) 

FROM sale 


GROUP BY vn 


ORDER BY 2 

DESC; 


SELECT vn, sum(prc*qty) , rank() 
OVER (ORDER BY sum (prc*qty) 
DESC) 

FROM sale 
GROUP BY vn 
ORDER BY 2 DESC; 


) Clause 


vn 

| 

sum 




40 

1 

2640002 

30 

1 

180 

50 

1 

0 

20 

1 

0 

10 

1 

0 

(5 

rows ) 


i 


vn | sum | 


40 

30 

50 

20 

10 


2640002 

180 

0 

0 

0 


(5 rows) 


rank 

1 

2 

3 

3 

3 





Designating a Sliding (Moving) Window 

A moving window: 

• Defines a set or rows in a window partition 

• Allows you to define the first row and last row 

• Uses the current row as the reference point 

• Can be expressed in rows with the ROWS clause 

• Can be expressed as a range with the RANGE clause 
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Designating a Sliding (Moving) Window (Continued) 

A moving window: 

• Defines a set or rows in a window partition 

• Allows you to define the first row and last row 

• Uses the current row as the reference point 

• Can be expressed in rows with the ROWS clause 

• Can be expressed as a range with the RANGE clause 
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Window Framing Example 


A rolling window moves through 
a partition of data, one row at a 
time. 

SELECT vn, dt, prc * qty 
ma = AVG (prc*qty) 

OVER (PARTITION BY vn 
ORDER BY dt 
ROWS BETWEEN 
2 PRECEDING AND 
2 FOLLOWING) 

FROM sale; 



1 

1 prc * 

1 

vn 

1 dt 

-|_ 

1 qty 

_|_ 

1 ma 

10 

| 03012008 

| 30 



1 30 

20 

] 05012008 

I 20 

| 20 

30 

| 05022008 

| 60 

| 50 

30 

| 05042008 

| 30 

I 67.5 

30 

| 05092008 

| 60 

1 67.5 

30 

| 05142008 

| 120 

1 70 

40 

| 06012008 

1 12 0 

1 110 

40 

| 06042008 

1 90 

| 127 . 5 

40 

| 06052008 

| 12 0 

1 127 . 5 

40 

| 06052008 

| 180 

1 130 

50 

| 06012008 

| 30 

1 20 

50 

| 06012008 

| 10 

1 20 

(12 

rows) 
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Window Framing Example (Continued) 


(Tl 


vn 

1 dt 

-l- 

1 prc * qty 

- + 

30 

| 05022008 

I 60 


30 

05042008 

30 


30 

| 05092008 

| 60 



30 | 05142008 | 120 



-+ 

— +- 






30 

| 05042008 

1 

30 


30 

| 05092008 

1 

60 


30 

t 05142008 

I 

120 




vn I dt 

+ 


v 


prc * qty 


-+■ 


30 

| 05022008 

| 60 

30 

| 05042008 

| 30 

30 

05092008 

60 

30 

1 05142008 

1 120 


I 


vn I dt 

+ 


prc 


qty 


-+- 


30 

| 05022008 

[ 60 

30 

| 05042008 

| 30 

30 

| 05092008 

1 60 

30 

I 05142008 

1 120 



Copyright © 2014 EMC Corporation. All Rights Reserved. Module 5: Advanced Analytics - Technology and Tools 88 



General Syntax of Window Function 



A moving window: 

• Is defined as part of a window with the ORDER BY clause as 
follows: 

WINDOW window_name AS (window_specif ication) 

where window_specification can be: 

[ window_name ] 

[PARTITION BY expression [ , ...]] 

[ORDER BY expression [ASC | DESC | USING operator] [ , ...] 

[ { RANGE | ROWS } 

{ UNBOUNDED PRECEDING 
| expression PRECEDING 
| CURRENT ROW 

| BETWEEN window frame bound AND window frame bound } ] ] 
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Built-In Window Functions 


Built-In Function 

Description 

dist ( ) 

Calculates the cumulative distribution of a value in a group 
of values. Rows with equal values always evaluate to the 
same cumulative distribution value. 

dense rank () 

Computes the rank of a row in an ordered group of rows 
without skipping rank values. Rows with equal values are 
given the same rank value. 

first value {expr) 

Returns the first value in an ordered set of values. 

lag(expr [, offset] 

Provides access to more than one row of the same table 

[ , default ] ) 

7 — i * 1 irrr 

without doing a self join. Given a series of rows returned 
from a query and a position of the cursor, LAG provides 
access to a row at a given physical offset prior to that 
position. If offset is not specified, the default offset is 1 . 
default sets the value that is returned if the offset goes 
beyond the scope of the window. If default is not specified, 
the default value is null. 

T m n v?, V / , ir n n l!v /YiL, L -i H ^ n - ^1 V. 7 1 TT? TITO / 'W U' U f ' li v,i n ill r A 


can also be used as a window function. 
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Built-In Window Functions (Continued) 


Built-In Function Description 


last value {expr) 

Returns the last value in an ordered set of values. 

lead ( ) 

Provides access to more than one row of the same table 
without doing a self join. Given a series of rows returned 
from a query and a position of the cursor, LEAD provides 
access to a row at a given physical offset after that 
position. If offset is not specified, the default offset is 1 . 
default sets the value that is returned if the offset goes 
beyond the scope of the window. If default is not specified, 
the default value is null. 

ntile (expr) 

Divides an ordered dataset into a number of buckets (as 
defined by expr) and assigns a bucket number to each 
row. 

percent rank() 

Calculates the rank of a hypothetical row R minus 1 , 
divided by 1 less than the number of rows being evaluated 
(within a window partition). 

row number ( ) 

Assigns a unique number to each row to which it is applied 
(either each row in a window partition or each row of the 
query). 
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Check Your Knowledge 

• Describe how this code will work: 

► SELECT dt, region, revenue, 

count(*) OVER (twdw) AS moving_count, 
avg(revenue) OVER (twdw) AS moving_average 
FROM moving_average_data mad 
WINDOW twdw AS (PARTITION BY region 
ORDER BY dt RANGE BETWEEN 
7 days': interval PRECEDING AND 
'0 days': interval FOLLOWING) 

ORDER BY region, dt 



Your Thoughts? 
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User Defined Functions and Aggregates 



Greenplum supports several function types, including: 

• Query language functions where the functions are written in SQL 

• Procedural language functions where the functions are written in: 

► PL/pgSQL 

► PL/TcL 

► Perl 

► Python 

► R 

• Internal functions 

• C-language functions 

• Use Case examples: 

► Second largest element in a column? 

► Online auction: Who is the second highest bidder? 
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Anatomy of a User-Defined Function 



Example: 


CREATE FUNCTION times2(INT) 
RETURNS INT 
AS@ 


Start function body 


SELECT 2 * $1 ♦ 
$\ANGUAGE sql 


©- 


Function body 


End function body 


SELECT times2(1); 
times2 


2 

(1 row) 
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User-Defined Aggregates 



• Perform a single table scan 

• Example: Second largest number 

► Keep a state: maximum 2 numbers 

► New number can displace the smaller one in the state 

► Greenplum extension: Merge two states 

• Example :Create a sum of cubes aggregate: 

CREATE FUNCTION scube_accum(numeric, numeric) RETURNS numeric 
AS 'select $1 + $2 * $2 * $2' 

LANGUAGE SQL 
IMMUTABLE 

RETURNS NULL ON NULL INPUT; 


CREATE AGGREGATE scube(numeric) ( 
SFUNC = scubeaccum, 

STYPE = numeric, 

INITCOND = 0 ); 
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Ordered Aggregates 

• Output of aggregates may depend on order 

► Example: 

SELECT array_agg(letter) FROM alphabet 

► SQL does not guarantee a particular order 

► Output could be {a,b,c} or {b,c,d} or ... depending on query optimizer, 
distribution of data, ... 

• Sample Use Case: 

►► Maximum value of discrete derivative? For example: 

Largest single-day stock increase during last year? 

• Greenplum 4.1 introduces ordered aggregates: 

► SELECT array_agg(c*o/4//77/7 ORDER BY expression [ASC|DESC]) 
FROM table 

• Median can be implemented using an ordered call of array_agg() 

► This will be covered in the Lab 
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MADlib: Definition 



• MAD stands for: 





• lib stands for library of: 

• advanced (mathematical, statistical, machine learning) 

• parallel & scalable 

• in-database functions 


• Mission: to foster widespread development of scalable analytic 
skills, by harnessing efforts from commercial practice, 
academic research, and open-source development. 
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MADlib: Getting Help... 


Check out the user guide with examples at: http: //doc, mad lib, net 


'/ i MADlib: Main Page - Windows Internet Explorer^J 




http://doc.madlib.net/v0.3/ 


▼ jgj | X [ P Search Results 

P -[ 

i File Edit View Favorites Tools Help 


^ Favorites ^ HJ Suggested Sites ▼ ffj Free Hotmail 


) MADlib: Main Page 


□ ▼ [3 ▼ Page ▼ Safety ▼ Tools ▼ © 


MADlib 0.3 

User Documentation 


Main Page 

► Modules 

► File List 

File Members 


MADlib Documentation 


The MADlib mission: to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open-source development. 

Useful links: 

• MADlib project site http://madlib.net/ 

• MADlib bug reporting site: http://jira.madlib.net/ and quick guide: https://github.com/madlib/madlib/wiki/Bug-reporting 

Please refer to the Read-Me file for information about incorporated third-party material. License information regarding MADlib and included third-party libraries can be found inside the license directory. 


1 Generated on Fri Apr 27 2012 21:49:23 for MADlib by 1-7.5.1 

Dope 


^ Internet | Protected Mode: On 

^ ^100% - 

E ei n ^ ^ r» 


r \ 98% ) a • o) e -v <e e « r? i 

...R A* ^ 1:21 PM | 


• Need more help? 

Try: http://qroups.aooale.com/qroup/madlib-user-forum 
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Greenplum In-database Analytical Functions 


Descriptive Statistics 

Modeling 

Quantile 

Association Rule Mining 

Profile 

K-Means Clustering 

CountMin (Cormode-Muthukrishnan) Sketch- 
based Estimator 

NaYve Bayes Classification 

FM (Flajolet-Martin) Sketch-based Estimator 

Linear Regression 

MFV (Most Frequent Values) Sketch-based 
Estimator 

Logistic Regression 

Frequency 

Support Vector Machines 

Histogram 

SVD Matrix Factorization 

Bar Chart 

Decision Trees/CART 

Box Plot Chart 

Neural Networks 

Correlation Matrix 

Parallel Latent Dirichlet Allocation 
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